1 - Introduction and Dataset description
2 - Data Exploration
The analyzed dataset contains information about mobile phones: each mobile phone sample is described by 20 features (or predictors) plus one additional number that refers to the price range of the phone.
The dataset includes a total of 2000 labelled samples.
The price range value is an integer value $[0-3]$:
The dataset can be found at https://www.kaggle.com/iabhishekofficial/mobile-price-classification#train.csv
I have chosen to use python3.7 as language code because it is very intuitive and one of the most popular in the Machine Learning world. Also, python have available a lot of powerful libraries that are plug-and-play and pretty easy to use. Compared to the R language, I find python much simpler and intuitive and this is why I have used it.
I wrote the code in Jupyter Notebook and I have found it very useful since it allows to insert code parts to markdown/textual parts. On top of that, it allows good visualization and graphic tools.
In this section I have analyzed the datset pointing out some information that I have considered relevant. To do this, I used some boxplots, heatmap and other measures.
As I said before, each mobile phone is described by 20 numerical features that are not only technical details about its hardware components but also values correspondent to the performances of the device. More in detail the features are:
In summary, the dataset has a total of 14 Numeric variables and 6 Boolean variables that point out the characteristics of each mobile phone.
There are no NULL/missing values in the datset.
The table above shows some statistical measures about each numerical predictor variable of the dataset:
For example: considering the battery_power variable:
- The average value is 1238.51 mAh;
- Its standard deviation is 439.42;
- The minimum value in the whole dataset is 501 mAh while the Maximum value is 1998 mAh;
- The 25% percentile is at 851.75 mAh, it means that in the range $[501 , 851.75] $ mAh, the 25% of the entire dataset is contained;
- The same for 50% and 75% percentiles.
The dataset presents numerical features with very different unit of measure: for example, "battery power" is 3 order of magnitude bigger that "clock speed", "depth" is three order of magnitude smaller that "px_width" and "ram" and many others.
For this reason I have decided to normalize the values of the dataset, in order to smoothen the value differences between the features.
The dataset is perfectly balanced as it includes 500 samples for each price category, for a total of 2000 samples. Given the already balanced structure of the dataset, I have considered as useless the employment of any balancing technique.
In general, a boxplot is useful to show in a simple way the value distribution assumed by some data. It consists in a box and a pair of whiskers: the box marks the range where the majority of values lie (from 25% to 75% percentiles), while the whiskers points respectively where the 5% and 95% percentiles are. Moreover, the 50% percentile coincides with the median and it is marked with a dash in the figure (remind that the median is the central value among data which splits in equal number the data at his right and at his left).
In the following plots I have displayed, for each numerical feature, the relative boxplot highlighting its value distribution.
There are no outliers except for the 'pixel_height' and 'front_camera' variables. The boxplot marks as outliers the points that exceed the whiskers, that mark the 5% and 95% percentiles.
A small subset of the features present an asymmetric value distribution; they are:
Their mean value do not coincide with their median and they have a high value of variance: it means that the points are likely to be pretty far from the expected values. Their distribution is not centered in the mean value but is shifted higher or lower.
I have plotted the distribution of values assumed by each numerical feature separately for each price category by means of a boxplot.
These Box plots shows the values assumed by the features considering the whole dataset, separately for each price category. I can see the value distribution of each feature among all the classes.
In summary, I can say that good features shows a positive and growing trend starting from class one (LOW) to class 3 (VERY-HIGH) and this is a reasonable consequence to the fact that devices with high quality specs are more expensive than devices with low quality specs.
I have plotted the correlation matrix of the features of the dataset. The matrix is symmetric and it shows the correlation degree that the pair of variables shares. I used the Pearson Correlation coefficient to calculate the realtionship between the feature variables: it is a value that describes the mutual variability between two variables.
it is defined by:
$ \ \rho _{{XY}}={\frac {\sigma _{{XY}}}{\sigma _{X}\sigma _{Y}}} $
Where $\ \sigma _{{XY}} $ is the covariance of $ X, Y $ and $ \ \sigma _{X},\sigma _{Y}$ are the two standad deviations.
Such correlation index is a number in the range $[-1 , 1]$:
The first thing that come at sight is the high correlation value between the RAM and the price category: high values of RAM are positively correlated with a high price range. It seems reasonable since expensive smartphones usually have a lot of RAM available.
Also, the pixel resolution shows a slight correlation with the price category, it could mirror reality because, usually, top-level phones have a high density of pixels and so a powerful display (which is clearly costly). Furthermore, the pixel resolution height and width are positively correlated: if height resolution increases, width increases proportionally as well and it seems reasonable.
The dimensions of the screen are positively correlated (width and height measured in centimeters); this is ok since the devices standard shape is rectangular and so it cannot happen that the height grows while the width remains inalterate.
Other significant relationships are the one that links 3G and 4G technologies and the one that exists between primary_cam and front_cam features.
Since the screen size is described by means of two variables (screen_height and screen_width) I choose to merge them in a unique variable that is the diagonal of the screen. To do that, I simply applied the Pythagoras Theorem and calculate the diagonal starting from the height and the width (it is possible thanks to the rectangular shape of mobile phones).
This way I have reduced by one the dimensionality of the dataset (20 --> 19 features).
The Principal Component Analysis is an unsupervised dimensionality reduction algorithm that allows to summarize a large feature set into a smaller number of features that together explains as much variability as possible of the original dataset. In other words, PCA finds a lower-dimensional representation of data that keeps as much information as possible.
The idea behind PCA is that not all the dimensions of the original dataset are equally interesting; PCA simply looks for a small number of features which are as much interesting as possible and it discards the others. The concept of interesting is related to the amount of variance explained.
The so-called principal components are nothing but the directions in the space that contains the maximum variance of the considered dataset; PCA finds a number of PC (directions) which is smaller than the number of the original features and then projects data into the new smaller-dimensional subspace. These directions are orthogonal each other and they are as close as possible to the cloud of data.
More in detail, the first PC of a set of features ${X_1,X_2, \dots , X_p}$, is the normalized linear combination of the features: $Z_1 = \beta_{{11}}X_1 + \beta_{{21}}X_2 + \dots + \beta_{{p1}}X_p $ that has the largest variance. The beta coefficients are the loadings of the first PC and together they constitute the PC Loading vector $\beta_1 = \{\beta_{{11}}, \beta_{{21}}, \dots , \beta_{{p1}}\}^T $.
Normalized means that $\sum_{j=0}^{p} \beta_{{j1}}^2 = 1$
The second PC $ Z_2 = \beta_{{12}}X_1 + \beta_{{22}}X_2 + ... + \beta_{{p2}}X_p $ is the linear combination of ${X_1,X_2, ... , X_p}$ that has the maximal variance in the residual space, that is out of all linear combinations that are uncorrelated with $ Z_1 $. And so on and so forth for all the other PCs.
Remind that putting the constraint that $ Z_2 $ is uncorrelated with $ Z_1 $ is equivalent to constraining that the direction $ \beta_1 $ is orthgonal with the direction $ \beta_2 $.
PCA is unsupervised because there is no associated response (Y) needed.
PCA is also a great tool for data visualization: selecting the first two PC, which are the most informative and representative, you can produce a scatter plot of a much higher-dimensional dataset which cannot be visualized otherwise.
If you want to apply PCA and reduce the dimensionality of a dataset $D$ with $n$ features to a new one with only $k$ features, these steps are required:
$D_{transformed} = D*W$
We can see from the plots above that the first 4 PC explains the 100% of the total variance of data, hence from the 5th PC on, the variance explained is zero. I can reduce the dimensionality of my dataset to 4 features that would coincide with the first 4 PC.
Hence, the size of the dataset is reduced from $ 2000x12 $ to $ 2000x4 $.
The following table shows the pruned dataset.
In the following plot I wanted to represent the data in 2D using PCA result: I selected the first two principal components that are the most informative and which explain about the $ 83$% of the total variance and I displayed a scatter plot showing also the class distributions with different colors.
The dataset is reduced from $ 2000x20 $ to $ 2000x2 $.
The class distributions are a little overlapped but each class can be easily distinguished; it seems that points belonging to low price categories tend to lie on the left side of the picture while high price categories points lie on the right side.
As I explained before in $3.2$, each PC is a normalized linear combination of the original features. The coefficients $\beta_i$ clustered together constitute the PCi Loading vector: it defines a direction in the feature space along which the data vary the most. Of course, each PC after the first and most informative, has a loading vector that points in a direction in the residual feature space which is orthogonal with respect to all the other PC directions.
In the next, I have displayed the values of the loading vectors for each PC.
The figure below shows the scatter plot of the first two Principal Components and the first two PC loading vectors (Blue lines).
As the plot shows, the first loading vector places most of its weight on Ram and zero on all the others, therefore this component represents the Ram-level of the device.
I reckon that the second loading vector has approximately the same weight on Screen_height and Screen_width while much less on the others. Hence, this component corresponds to a measure of the physical screen dimensions of the mobile phone;
Since there is a huge scale difference in the loading vectors, I have decided to zoom the center of the previous plot to better visualize the overlapped loading vectors of the other features.
As the section title says, I have exploited the Random Forest classifier, fitted on the data, to extract a measure of "importance" related to each feature; the algorithm assigns a value of importance to each feature: the higher the value and the more important the algorithm retains that feature in the classification task.
In scikit-learn, the importance coefficient is called “gini importance” or “mean decrease impurity” and is defined as the total decrease in node impurity weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node) averaged over all trees of the ensemble.
More in detail:
Evaluate the importance of a variable $X_m$ for predicting $y$ can be done by adding up the weighted impurity decreases $p(t)\Delta i(s_t , t)$ for all nodes $t$ where $X_m$ is used, averaged over all $N_T$ trees in the forest.
Impurity formula:
$Imp(X_m) = \frac{1}{N_T} \sum_T \sum_{t \in T : v(s_t) = X_m} p(t)\Delta i(s_t , t)$
where $p(t)$ is the proportion $\frac{N_T}{N}$ of samples reaching t and $v(s_t)$ is the variable used in split $s_t$.
Steps followed:
In section 4.2 - Random Forest I will explain more in detail the Random Forest classifier and also the Gini Index in Purity coefficient: GINI
The plot above shows that the trend assumed by the performance of the model as I progresively add one feature at each iteration. At first, when I add the most important features according to the Random forest Gini Importance, the F-Measure grows and so the model increase its effectiveness in recognizing the classes but from the 5th feature on, the performances slowly decrese even if in a non-linear way.
Therefore I think that a reasonable choice could be to keep the first 5 features that have been added because they are a good trade off between the number of features I keep and the performance obtained. The resultant pruned datset will have a shape of (2000, 5), keeping the 5 most important features according to the Random Forest metric.
I know that there are many other ways to combine the features which might give better results, but I retained an exaustive exploration of the entire space of the solutions out of the scope of this analysis.
In this section I have explained the working principles of some of the classification algorithms we have studied in the lectures and I have applied them on the "cleaned" dataset, trying to understand what algorithm works better in classifying a sample in the correct price category.
For each classifier employes I have followed these steps:
As a preliminary step, I have performed a search for the best setting of the hyperparameters of each algorithm: I have trained several different models created with a different parameter value and I have picked up the setting that give the best outcome according to the F-Measure.
For each algorithm I have split the dataset into train and test set with proportions 70% training and 30% test: the training one is employed to build the model while the test one to evaluate its performance.
I have chosen to take track of the accuracy, precision, recall and F-Measure of the model.
I have performed a K-fold cross-validation of the training set with K = 10 in order to create a robust model as little biased as possible. Once the model is trained, I evaluate once more its performances on the test set I took apart before.
I repeated step 2 and 3 using a dataset which has been reducted with one of the techniques described before (PCA and Random Forest-based feature selection).
The classification algorithms that I have implemented are:
- Decision Tree
- Random Forest
- Support Vector Machine
- K Nearest Neighbor
The confusion matrix is a matrix that helps in describing the performance of a classification algorithm on a test set for which the true labels are known. It gives a summary of the classifier predictions by showing the count of both correct and wrong predictions separately for each class.
On the columns we find the ground truth (correct labels) while on the rows we find the model's predictions.
For a simple binary classification, there are four main classes of predictions:
It is defined by the relation: $\frac{TP + TN}{TP + TN + FP + FN}$
It gives an estimate of how many errors the model has done, assuming an equal cost for each error.
It is defined by the relation: $\frac{TP}{TP + FP}$
It tells how many samples that the model classified to a certain class effectively belong to that class. It gives an estimate of the precision of the decisions the model makes.
It is defined by the relation: $\frac{TP}{TP + FN}$
The recall can be defined as the ratio of the total number of correctly classified positive examples divided by the total number of positive examples. A high value of Recall indicates how much the considered class is correctly recognized by the model. In other words, it tells how many samples of a certain class have been correctly classified to that class.
It is defined by the relation: $\frac{2*Recall*Precision}{Recall+Precision}$
It is a sort of trade off that take into account both Recall and Precision; it gives a measure of how good is the model at recognizing a certain class and how precise the model's decision are.
K-fold Cross-Validation technique consists in dividing the training set in K distinct folds, all containing the exact number of samples, and then, at each step of the training phase, the $ Kth $ portion becomes the validation set while all the other parts togeher constitute the training set.
This process is repeated K times, selecting each time a different portion as validation set.
Each iteration the model is updated and some statistics about its performance are computed;
At the end of the training process I have calculated the average of the above-mentioned measures, obtaining a score of how good the model has been on the training set.
Notice that the Cross-Validation technique is much more resistant than a simple division in Train and Test set: through many iterations, it allows to create a number of different samples of the dataset which are used as training set and which avoids to provide the model a biased portion of data which may not be a good representation of the dataset. In other words, providing the model K different training sets allows it to better understand the characteristics of the data improving its classification capabilities.
A Decision tree is a flowchart like tree structure where:
Tree splits can be binary or multiple and the depth of the tree is limited. A decision tree represents a procedure for classifying categorical data based on their attributes: a new test sample goes down in the tree starting from the root and following the branches towards a leaf taking a series of decisions which are based on the values of the sample's attributes.
Algorithms for constructing decision trees usually work top-down, by choosing each step the variable that best splits the set of items; therefore, starting from the entire dataset, each step the best split attribute is chosen and a new node is created. The best split is the one that produces two (or more) output subsets which are made by samples belonging to the same class as much as possible. The more a split on an attribute separates the classes, the better it is. The process terminates when a certain depth is reached or when a certain degree of purity is reahced.
In order to decide the best attribute to use to split the data, I used the Gini index which gives a measure of the purity degree or level of disorder of the resultant splits generated downstream the node 't'.
The Gini index of the node 't' is defined by the relation: $GINI(t) = 1 - \sum_{j} [p(j/t)]^2$
$ p(j/t) $ refers to the relative frequence of class 'j' at the node 't'.
A value of Gini equal to zero corresponds to the perfect separation where a subset contains all values of Class#1 and the other subset contains all values belonging to class#2. The worst possible separation has a Gini of 0.5, this is the case in which the samples in the subsets are 50/50 mixed.
The strengths of decision tree methods are:
- Decision trees are able to generate understandable rules.
- Decision trees perform classification without requiring much computation (Fast and cheap).
- Decision trees are able to handle both continuous and categorical variables without the need to create dummy variables (Flexibility).
- Decision trees provide a clear indication of which fields are most important for prediction or classification.
The weaknesses of decision tree methods are:
- In general, decision trees do not have the same level of predictive accuracy compared to other algorithms.
- Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute (but this is not the case beause I am predicting a label).
- Decision trees are suffer from high variance: it means that fitting two different trees respectively on two random subsets of the same dataset, can produce quite different results.
- Decision tree can be computationally expensive to train. The process of growing a decision tree is computationally expensive because, at each node, each candidate splitting attribute must be evaluated before its best split can be found.
- Finally, decision trees can be very non-robust: a small change in the data can cause a big change in the final tree.
Bagging¶
As I have said before, decision trees suffer from high variance; bagging is a general-purpose technique that aims at reducing the variance of a statistical learning method. The Statistical basis behind this technique is that given $n$ independent observations $Z_1, Z_2, \dots, Z_n$, each one with variance equal to $\sigma^2$, the resultant variance of the observation mean $Z$ is equal to $\frac{{\sigma^2}} {{n}}$.
Bagging is usually employed and particularly effective in the context of decision trees.
By pooling predictions, we can incorporate much more knowledge than from any one individual model/tree, each one brings their own background experience and information sources to the problem. In short, what bagging does is to construct B decision trees using B bootstrapped training sets (generated from a single unique dataset) and then average the resulting predictions $f^b$. Even if each tree has a high variance, combining a big number of trees together in the same procedure reduces the variance.Bagging relation for prediction: $f_{{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} f^b(x)$
Random Forest algorithm introduces an improvement over bagged trees that decorrelates the trees. When the set of decision trees are built, each time a split in a tree is considered, a random sample of N predictors is chosen as split candidates from the entore set of P predictors. Therefore, each split is allowed to use only one of those N predictors (typically $N = \sqrt{{P}}$). A Random Forest with N = P conincide with bagging.
The novelty and great idea of Random Forest is that it forces each split node to consider only a subset of the predictors. This way, the estimators, that are the trees of the forest will not be influenced by the same strong predictors of the dataset but they are obliged to consider also other less-strong predictors thence decorrelating the trees. From this rationale comes the model's name Random Forest.
The main limitation of Random Forest is that a large number of trees can make the algorithm to slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained. A more accurate prediction requires more trees, which results in a slower model.
On the other hand, thanks to its structure the Random Forest algorithm produces optimal results even with its default hyperparameters and in addition to that it is pretty resistant to overfitting.
In summary:
As it ould be expected, Random Forest algorithm performs better than simple decision tree; this is absolutely reasonable since the random forest merges many outputs of its internal decision trees providing a better estimation and comprehension of the classification problem.
However, Random Forest is much more slow than the Decision Tree in terms of computation time, especially with a high number of estimators (like 800 for example).
Surprisingly, the Random Forest-based method for feature selection has provided good results for both Decision Tree and Random forest models, indeed their overall performances are slightly better, around 3-4% more. Hence, cutting the dataframe size keeping only a subset of features has not only lightened the computation load of the program but also made it faster.
If, for some reasons, the classification task must be made faster, this approach could be taken in high consideration.
The Support Vector Machine or SVM bases its classification task on the concept of hyperplane: the model search for the hyperplane that separates the data into groups in the best possible way.
Mathematical definition of a p-dimensional hyperplane: $ f(x): \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p = 0$
Given a datsaet that is linearly separable, the best hyperplane among infinite solutions is the one that maximises the Margin: the margin is the smallest distance from the training sample $X_i$ of each class to the hyperplane. Hence, the optimal separating hyperplane is the furthest one from the training data point $X_i$ of each class.
Given a point $X$, it can lie:
In summary, M represents the margin of the hyperplane and we want to find the $\beta$ coefficients in order to maximise M.
The maximal margin hyperplane can be calculated by solving this optimization problem (Hard margin classifier):
$\max_{{\beta_1, \beta_2, \dots, \beta_p}}M$
such that
$\sum_{k=1}^{p} \beta_k^2 = 1$
$y_i (\beta_0 + \beta_1x_{i1} + \dots + \beta_px_{ip}) >= M $
for each $i = 1 \dots N$
However, as I said, this solution to the problem forces the points to stay at least at distance M from the hyprplane; this could be a problem for datsets which have a distribution of samples that do not allow this constraint.
For this reason, a soft margin classifier can be created by relaxing the constraint on the minimum distance between the points and the separating hyperplane; in other words, rather than seeking the largest possible margin which separates perfectly each observation on the correct side (and outside the 'margin region'), the model allows some training observations to be misclassified in order to do a better job in classifying all the remaining observations.
Soft margin classifier optimization problem:
$\max_{{\beta_1, \beta_2, \dots, \beta_p, \epsilon_1, \dots, \epsilon_n}}M$
such that
$\sum_{k=1}^{p} \beta_k^2 = 1$
$y_i (\beta_0 + \beta_1x_{i1} + \dots + \beta_px_{ip}) >= M(1-\epsilon_i) $
$\epsilon_i >= 0$, $\sum_{i=1}^{n} \epsilon_i <= C$
K-NN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes (they are labelled) to predict the classification of a new sample point.
Given a new sample $X$ to be classified, the model calculates the distance between the new point and all the others in the dataset, then it takes the $K$ nearest points and the new object is classified by a majority vote of its neighbors, so it is given the most common class among its k nearest neighbors.
K-NN is based on the concept of feature similarity and n-dimensional spatial distance between points.
K-NN is non-parametric, which means that it does not make any assumptions about the probability distribution of the input. This is useful for applications with input properties that are unknown and therefore makes k-NN more robust than algorithms that are parametric.
However, parametric machine learning algorithms tend to produce fewer errors than non-parametric ones, since taking input probabilities into account can influence decision making.
A key point in K-NN algorithm is the choice of the value of K:
KNN pros:
KNN cons:
TODO: va bene usare la distanza euclidiana? abbiamo uno spazio delle feature adatto a tale metrica?
I have provided to both the models the normalized version of the dataset containing only numerical features; It seemed to me the best choice since these two algorithms take decisions by manipulating distances.
KNN performs very well, even better than the SVM. Anyway, SVM provides performances that are near to 90% and so thay are extremely good as well.
KNN results seems even too high, maybe its due to the fact that I have set a high value of K (K = 15), and so probably there is some overfitting. However, the K-fold crossvalidation should have mitigated the overfitting.
Finally, I have run the the same classifiers but using the PCA-transformed dataset and I have obtained nearly the same results on each measure as regards KNN, but a little decrease as regards SVM (around 3-4%); this is interesting because the dataset contained only 4 features which are the 4 most important Principal Components obtained by PCA. So, Even if the dataset dimensionality has been reduced a lot, results are still extremely good and there is not a big performance decrease. In this case PCA does a great job.
A common aspect to all the employed classifiers is that the two middle classes (Medium-Low and Medium-High) shows worse results in terms of recall and precision: this means that the models make more mistakes in recognizing them, while they are better in recognizing the classes at the extremes (Low and Very-High).
The classifier that works best is the KNN while the worst is the Decision Tree classifier; SVM and random forest are in the middle and offer reliable and good results as often happens.
I have decided to launch each kind of classifier with the dataset I retained the best for its characteristics and working principles to make it works at its best; therefore, I compared the different classifiers I employed even though they are trained on different versions of the same dataset.
As regards computation time (time to get the classification results), for sure Random forest is the slowest one due to its high number of trees, KNN is probably a bit slower than the SVM and decision tree which provide results in a few seconds.
In this analysis I have deeply examined as many characteristics and behaviors of the dataset as I could; I have explored a set of different classifiers trained on the data and I have seen how they performed. Moreover, I have tried to point out the most effective features which determine the price range of a mobile phone.
Even if the best results have been obtained by the KNN model, I would suggest the Random Forest classifier as the best one to classify a new sample in the correct price category because in general it is a much more reliable and resistant algorithm than KNN and , furthermore, I suspect that a strong overfitting is the reason of the extremely good performances of KNN. Last, Random Forest has the highest scores as regards precision and recall of middle classes (medimum and high), so it manages to perform a good work where the other algorithms encounter difficulties.
The features that contribute the most in the classification task are: RAM, pixel height and width (hence the screen resolution) and the battery power; by analyzing only such features you can give a quite precise estimate of the price category of the correspondent mobile phone.
In conlusion, this analysis has allowed me to dirty my hands in extracting information out of a dataset and using, in a real case, the dimensionality reduction methods and classification algorithms I have studied in theory.